-
Notifications
You must be signed in to change notification settings - Fork 79
Bump llama.cpp to b7356 #501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @ilopezluna, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates a significant update to the core Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- In the speculative decoding acceptance loop,
SLT_DBGlogsslot.drafted.size()afterslot.draftedhas been cleared, so the logged accepted/total draft counts will always show 0 for the draft size; capture the originaln_draft(already computed) or move the debug log before clearingdrafted. - The new
task_result_state/is_updatedplumbing assumesserver_task_result::update()is always called beforeto_json(); if any existing or future call site bypassesserver_response_reader::next()and callsto_json()directly, it will hit the assertion—consider either documenting this invariant clearly or providing a safer default path for non-streaming/non-HTTP usages.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the speculative decoding acceptance loop, `SLT_DBG` logs `slot.drafted.size()` after `slot.drafted` has been cleared, so the logged accepted/total draft counts will always show 0 for the draft size; capture the original `n_draft` (already computed) or move the debug log before clearing `drafted`.
- The new `task_result_state` / `is_updated` plumbing assumes `server_task_result::update()` is always called before `to_json()`; if any existing or future call site bypasses `server_response_reader::next()` and calls `to_json()` directly, it will hit the assertion—consider either documenting this invariant clearly or providing a safer default path for non-streaming/non-HTTP usages.
## Individual Comments
### Comment 1
<location> `llamacpp/native/src/server/server-context.cpp:2573-2582` </location>
<code_context>
// the accepted tokens from the speculation
- const auto ids = common_sampler_sample_and_accept_n(slot.smpl, ctx, draft);
+ const auto ids = common_sampler_sample_and_accept_n(slot.smpl, ctx, slot.i_batch_dft, slot.drafted);
+ slot.i_batch_dft.clear();
+ slot.drafted.clear();
</code_context>
<issue_to_address>
**issue (bug_risk):** Speculative decoding debug log uses `slot.drafted.size()` after the vector is cleared.
After `common_sampler_sample_and_accept_n` you clear `slot.drafted`, but the final `SLT_DBG` still uses `slot.drafted.size()` for the total draft count, so it will always log `0`. Capture the draft count before clearing (e.g., `const auto n_draft = slot.drafted.size();`) and log that, or move the log before `slot.drafted.clear()`.
</issue_to_address>
### Comment 2
<location> `llamacpp/native/src/server/server-context.cpp:429-437` </location>
<code_context>
return res;
}
+
+ void copy_state_to(server_slot & other) const {
+ llama_memory_seq_rm(llama_get_memory(ctx), other.id, 0, -1);
+ llama_memory_seq_cp(llama_get_memory(ctx), id, other.id, 0, -1);
+ other.n_decoded = n_decoded;
+ other.n_remaining = n_remaining;
+ other.i_batch = i_batch;
+ other.n_prompt_tokens_cache = n_prompt_tokens_cache;
+ other.n_prompt_tokens_processed = n_prompt_tokens_processed;
+ other.prompt = prompt.clone();
+ }
};
</code_context>
<issue_to_address>
**suggestion (bug_risk):** copy_state_to does not reset child-specific transient state, which can leak speculative/decoder bookkeeping to children.
When copying from a parent to a child in `copy_state_to`, you only copy KV/memory and a few counters, but leave per-generation transient fields on `other` (e.g., `drafted`, `i_batch_dft`, and similar speculative/decoder bookkeeping) unchanged. If the child slot was previously used, these fields may contain stale values and influence later speculative decoding or sampling. Please ensure all speculative/transient generation state on `other` is reset or reinitialized so children always start cleanly from the parent’s state.
Suggested implementation:
```cpp
void copy_state_to(server_slot & other) const {
// copy KV / memory state from this slot to the target slot
llama_memory_seq_rm(llama_get_memory(ctx), other.id, 0, -1);
llama_memory_seq_cp(llama_get_memory(ctx), id, other.id, 0, -1);
// copy persistent generation counters / prompt state
other.n_decoded = n_decoded;
other.n_remaining = n_remaining;
other.i_batch = i_batch;
other.n_prompt_tokens_cache = n_prompt_tokens_cache;
other.n_prompt_tokens_processed = n_prompt_tokens_processed;
other.prompt = prompt.clone();
// reset all transient / speculative decoding state on the child so it starts clean
// from the parent's state without inheriting stale bookkeeping from a previous use
other.drafted.clear();
other.i_batch_dft = 0;
other.n_draft_accepted = 0;
other.n_draft_total = 0;
other.draft_ratio = 0.0f;
other.has_draft = false;
other.draft_in_flight = false;
other.i_batch_sampling = 0;
other.sampled_tokens.clear();
}
```
The above implementation assumes the following transient/speculative members exist on `server_slot` (or its enclosing type):
- `std::vector<...> drafted;`
- `int i_batch_dft;`
- `int n_draft_accepted;`
- `int n_draft_total;`
- `float draft_ratio;`
- `bool has_draft;`
- `bool draft_in_flight;`
- `int i_batch_sampling;`
- `std::vector<...> sampled_tokens;`
If the actual names or types differ in your codebase, adjust the reset section accordingly to:
1. Clear all containers that hold speculative or per-draft data (e.g., `drafted`, `draft_tokens`, etc.).
2. Reset all speculative indices/counters (e.g., `i_batch_dft`, `n_draft_accepted`, `n_draft_total`, acceptance ratios).
3. Reset any flags that indicate draft/speculative activity (e.g., `has_draft`, `draft_in_flight`, `sampling_in_progress`).
Ensure that every field that is mutated only during a single generation/speculative pass is either:
- copied from the parent when it is actually part of the persistent state, or
- explicitly reset here so reused child slots never carry stale speculative state.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| const auto ids = common_sampler_sample_and_accept_n(slot.smpl, ctx, slot.i_batch_dft, slot.drafted); | ||
| slot.i_batch_dft.clear(); | ||
| slot.drafted.clear(); | ||
|
|
||
| slot.n_decoded += ids.size(); | ||
|
|
||
| slot.t_token_generation = std::max<int64_t>(1, t_current - slot.t_start_generation) / 1e3; | ||
|
|
||
| // update how many tokens out of those tested were accepted | ||
| slot.n_draft_accepted += ids.size() - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Speculative decoding debug log uses slot.drafted.size() after the vector is cleared.
After common_sampler_sample_and_accept_n you clear slot.drafted, but the final SLT_DBG still uses slot.drafted.size() for the total draft count, so it will always log 0. Capture the draft count before clearing (e.g., const auto n_draft = slot.drafted.size();) and log that, or move the log before slot.drafted.clear().
| void copy_state_to(server_slot & other) const { | ||
| llama_memory_seq_rm(llama_get_memory(ctx), other.id, 0, -1); | ||
| llama_memory_seq_cp(llama_get_memory(ctx), id, other.id, 0, -1); | ||
| other.n_decoded = n_decoded; | ||
| other.n_remaining = n_remaining; | ||
| other.i_batch = i_batch; | ||
| other.n_prompt_tokens_cache = n_prompt_tokens_cache; | ||
| other.n_prompt_tokens_processed = n_prompt_tokens_processed; | ||
| other.prompt = prompt.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): copy_state_to does not reset child-specific transient state, which can leak speculative/decoder bookkeeping to children.
When copying from a parent to a child in copy_state_to, you only copy KV/memory and a few counters, but leave per-generation transient fields on other (e.g., drafted, i_batch_dft, and similar speculative/decoder bookkeeping) unchanged. If the child slot was previously used, these fields may contain stale values and influence later speculative decoding or sampling. Please ensure all speculative/transient generation state on other is reset or reinitialized so children always start cleanly from the parent’s state.
Suggested implementation:
void copy_state_to(server_slot & other) const {
// copy KV / memory state from this slot to the target slot
llama_memory_seq_rm(llama_get_memory(ctx), other.id, 0, -1);
llama_memory_seq_cp(llama_get_memory(ctx), id, other.id, 0, -1);
// copy persistent generation counters / prompt state
other.n_decoded = n_decoded;
other.n_remaining = n_remaining;
other.i_batch = i_batch;
other.n_prompt_tokens_cache = n_prompt_tokens_cache;
other.n_prompt_tokens_processed = n_prompt_tokens_processed;
other.prompt = prompt.clone();
// reset all transient / speculative decoding state on the child so it starts clean
// from the parent's state without inheriting stale bookkeeping from a previous use
other.drafted.clear();
other.i_batch_dft = 0;
other.n_draft_accepted = 0;
other.n_draft_total = 0;
other.draft_ratio = 0.0f;
other.has_draft = false;
other.draft_in_flight = false;
other.i_batch_sampling = 0;
other.sampled_tokens.clear();
}
The above implementation assumes the following transient/speculative members exist on server_slot (or its enclosing type):
std::vector<...> drafted;int i_batch_dft;int n_draft_accepted;int n_draft_total;float draft_ratio;bool has_draft;bool draft_in_flight;int i_batch_sampling;std::vector<...> sampled_tokens;
If the actual names or types differ in your codebase, adjust the reset section accordingly to:
- Clear all containers that hold speculative or per-draft data (e.g.,
drafted,draft_tokens, etc.). - Reset all speculative indices/counters (e.g.,
i_batch_dft,n_draft_accepted,n_draft_total, acceptance ratios). - Reset any flags that indicate draft/speculative activity (e.g.,
has_draft,draft_in_flight,sampling_in_progress).
Ensure that every field that is mutated only during a single generation/speculative pass is either:
- copied from the parent when it is actually part of the persistent state, or
- explicitly reset here so reused child slots never carry stale speculative state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request incorporates a significant update from the upstream llama.cpp repository, introducing several major features and refactorings. Key enhancements include support for generating multiple completions (n > 1) through a new parent-child task model, a substantial rework of the speculative decoding logic for improved performance and organization, and a cleaner state management design for chat messages. The changes also bring more flexible configuration for multi-model serving via presets and improve overall server robustness with better error handling. My review identified a minor logging bug and an opportunity for code simplification, but overall, the changes are of high quality and represent a solid step forward.
| for (auto it = map_idx_to_media.begin(); it != map_idx_to_media.end(); ++it) { | ||
| size_t idx = it->first; | ||
| const mtmd::input_chunk_ptr & chunk = it->second; | ||
| res.map_idx_to_media[idx] = mtmd::input_chunk_ptr(mtmd_input_chunk_copy(chunk.get())); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better readability and to use a more modern C++ approach, this iterator-based for loop can be replaced with a range-based for loop. This makes the code more concise and easier to understand.
for (const auto & pair : map_idx_to_media) {
size_t idx = pair.first;
const mtmd::input_chunk_ptr & chunk = pair.second;
res.map_idx_to_media[idx] = mtmd::input_chunk_ptr(mtmd_input_chunk_copy(chunk.get()));
}| } | ||
|
|
||
| SLT_DBG(slot, "accepted %d/%d draft tokens, new n_tokens = %d\n", (int) ids.size() - 1, (int) draft.size(), slot.prompt.n_tokens()); | ||
| SLT_DBG(slot, "accepted %d/%d draft tokens, new n_tokens = %d\n", (int) ids.size() - 1, (int) slot.drafted.size(), slot.prompt.n_tokens()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a bug in this debug log. slot.drafted.size() is used to report the number of generated draft tokens, but slot.drafted is cleared on line 2575, causing this to always log 0. The n_draft variable, which captured the size before clearing, should be used instead for accurate logging.
SLT_DBG(slot, "accepted %d/%d draft tokens, new n_tokens = %d\n", (int) ids.size() - 1, (int) n_draft, slot.prompt.n_tokens());
No description provided.